Horizontal Data Integration

Author

Constantin Ahlmann-Eltze

Published

December 8, 2023

Figure 1 from Computational principles and challenges in single-cell data integration by Argelaguet et al. (2021). Horizontal data integration is concerned with relating cells measured in different conditions or batches where we have the same features (i.e., genes) for the cells.

To start we will load the tidyverse packages and SingleCellExperiment:

library(tidyverse)
library(SingleCellExperiment)

Example data

For this tutorial, we will use a popular dataset by Kang et al. (2018). The dataset measured the effect of interferon-\(\beta\) stimulation on blood cells from eight patients. The muscData package provides an easy way to access the data as a SingleCellExperiment.

sce <- muscData::Kang18_8vs8()
see ?muscData and browseVignettes('muscData') for documentation
loading from cache
sce
class: SingleCellExperiment 
dim: 35635 29065 
metadata(0):
assays(1): counts
rownames(35635): MIR1302-10 FAM138A ... MT-ND6 MT-CYB
rowData names(2): ENSEMBL SYMBOL
colnames(29065): AAACATACAATGCC-1 AAACATACATTTCC-1 ... TTTGCATGGTTTGG-1
  TTTGCATGTCTTAC-1
colData names(5): ind stim cluster cell multiplets
reducedDimNames(1): TSNE
mainExpName: NULL
altExpNames(0):

We log-transform the data to account for the heteroskedasticity, perform PCA to reduce the dimensions, and run UMAP for visualization. For the preprocessing, we will use scater package which adds a new assay called "logcounts" and two reducedDims(sce) called "PCA" and "UMAP" . Equivalent steps exist in Seurat or scanpy.

sce <- scater::logNormCounts(sce)
hvg <- order(MatrixGenerics::rowVars(logcounts(sce)), decreasing = TRUE)
sce <- sce[hvg[1:500], ]
sce <- scater::runPCA(sce, ncomponents = 50)
sce <- scater::runUMAP(sce, dimred = "PCA")

To visualize the data, we use ggplot2

as_tibble(colData(sce)) |>
  mutate(umap = reducedDim(sce, "UMAP")) |>
  ggplot(aes(x = umap[,1], y = umap[,2])) +
    geom_point(aes(color = stim), size = 0.3) +
    coord_fixed()

Figure 1: UMAP of log transformed counts

The following code is based on Seurat’s Guided Clustering Tutorial.

# For more information about the conversion see `?as.Seurat.CellDataSet`
seur_obj <- Seurat::as.Seurat(muscData::Kang18_8vs8(), data = NULL)
seur_obj <- Seurat::NormalizeData(seur_obj, normalization.method = "LogNormalize", scale.factor = 10000)
seur_obj <- Seurat::FindVariableFeatures(seur_obj, selection.method = "vst", nfeatures = 500)
# Subset to highly variable genes for memory efficiency
seur_obj <- seur_obj[Seurat::VariableFeatures(object = seur_obj),]
seur_obj <- Seurat::ScaleData(seur_obj)
seur_obj <- Seurat::RunPCA(seur_obj, verbose = FALSE)
seur_obj <-Seurat::RunUMAP(seur_obj, dims = 1:10)
Seurat::DimPlot(seur_obj, reduction = "umap", group.by = "stim")

Figure 1 shows that the data separates by the treatment status. For many downstream analyses, it would be good to know how the cells from the stimulated condition are related to the cells from the control condition. For example for cell type assignment, we might want to annotate both conditions together and want to discount the effect of the treatment. This process is called integration.

The goal is get a low-dimensional embedding of the cells where the stimulation does not affect the embedding and all residual variance comes from different cell states. Figure 2 shows a sucessfully integrated example.

Figure 2: UMAP of a succesfully integrated dataset.

Integration approaches

There are many methods for single-cell data integration and Luecken et al. (2022) benchmarked several approaches. Here, I will present four integration methods which are easy to use from R and cover a useful set of features:

  • Manual projection
  • Automated integration
    • Harmony
    • MNN
  • Invertible integration
    • LEMUR

Manual Projection

Schematic picture of data from two conditions with the linear subspace that approximates the control condition
ctrl_mat <- logcounts(sce)[,sce$stim == "ctrl"]
stim_mat <- logcounts(sce)[,sce$stim == "stim"]

ctrl_centers <- rowMeans(ctrl_mat)
stim_centers <- rowMeans(stim_mat)

ctrl_pca <- irlba::prcomp_irlba(t(ctrl_mat - ctrl_centers), n = 20, center = FALSE)
ctrl_proj <- t(ctrl_pca$rotation) %*% (ctrl_mat - ctrl_centers)
stim_proj <- t(ctrl_pca$rotation) %*% (stim_mat - stim_centers)

manual_proj <- matrix(NA, nrow = 20, ncol = ncol(sce))
manual_proj[,sce$stim == "ctrl"] <- as.matrix(ctrl_proj)
manual_proj[,sce$stim == "stim"] <- as.matrix(stim_proj)
manual_proj_umap <- uwot::umap(t(manual_proj))

as_tibble(colData(sce)) |>
  mutate(umap = manual_proj_umap) |>
  sample_frac() |>
  ggplot(aes(x = umap[,1], y = umap[,2])) +
    geom_point(aes(color = stim), size = 0.3) +
    coord_fixed()

Schematic picture of data from two conditions using the stimulated condition as reference.

The projection is orthogonal onto the subspace, which means it matters which condition is chosen as reference.

stim_pca <- irlba::prcomp_irlba(t(stim_mat - stim_centers), n = 20, center = FALSE)
ctrl_proj2 <- t(stim_pca$rotation) %*% (ctrl_mat - ctrl_centers)
stim_proj2 <- t(stim_pca$rotation) %*% (stim_mat - stim_centers)

manual_proj2 <- matrix(NA, nrow = 20, ncol = ncol(sce))
manual_proj2[,sce$stim == "ctrl"] <- as.matrix(ctrl_proj2)
manual_proj2[,sce$stim == "stim"] <- as.matrix(stim_proj2)

manual_proj_umap2 <- uwot::umap(t(manual_proj2))

as_tibble(colData(sce)) |>
  mutate(umap = manual_proj_umap2) |>
  sample_frac() |>
  ggplot(aes(x = umap[,1], y = umap[,2])) +
    geom_point(aes(color = stim), size = 0.3) +
    coord_fixed()

For this example, using the "stim" condition as the reference leads to a worse integration.

The projection approach consists of three steps:

  1. Centering the data (e.g., ctrl_mat - ctrl_centers).
  2. Choosing a reference condition and calculating the subspace that approximates the data from the reference condition (irlba::prcomp_irlba(t(stim_mat - stim_centers))$rotation).
  3. Projecting the data from the other conditions onto that subspace (t(ctrl_pca$rotation) %*% (stim_mat - stim_centers)).

For arbitrary experimental designs, we can perform the centering with a linear model fit:

# A complex experimental design
lm_fit <- lm(t(logcounts(sce)) ~ condition + batch, data = colData(sce))
centered_mat <- t(residuals(lm_fit))
# Assuming that `is_reference_condition` contains a selection of the cells
ref_pca <- irlba::prcomp_irlba(centered_mat[,is_reference_condition], ...)
proj_mat <- t(ref_pca$rotation) %*% centered_mat

Automatic integration

Harmony

One popular tool for data integration is Harmony (Korsunsky et al. 2019). Harmony is build around maximum diversity clustering Figure 3, which in addition to minimizing the distance of each data point to a cluster center also maximizes the mixing of conditions assigned to each cluster.

Figure 3: Schematic of Harmony. Screenshot from Fig. 1 of Korsunsky et al. (2019)
harm_mat <- harmony:::RunHarmony(reducedDim(sce, "PCA"), colData(sce), 
                                 vars_use = c("stim"))
Transposing data matrix
Initializing state using k-means centroids initialization
Harmony 1/10
Harmony 2/10
Harmony 3/10
Harmony 4/10
Harmony converged after 4 iterations
harm_umap <- uwot::umap(harm_mat)

as_tibble(colData(sce)) |>
  mutate(umap = harm_umap) |>
  sample_frac() |>
  ggplot(aes(x = umap[,1], y = umap[,2])) +
    geom_point(aes(color = stim), size = 0.3) +
    coord_fixed()

MNN

MNN is short for mutual nearest neighbors and was invented for integrating two conditions by identifying the cells which are mutually nearest neighbors Figure 4. The batchelor provides an efficient implementation which can also handle experimental designs with more than two conditions.

Figure 4: Schematic of MNN Screenshot from Fig. 1 of Haghverdi et al. (2018)
mnn_sce <- batchelor::fastMNN(sce, batch = sce$stim, BSPARAM=BiocSingular::IrlbaParam())
mnn_umap <- uwot::umap(reducedDim(mnn_sce, "corrected"))

as_tibble(colData(sce)) |>
  mutate(umap = mnn_umap) |>
  sample_frac() |>
  ggplot(aes(x = umap[,1], y = umap[,2])) +
    geom_point(aes(color = stim), size = 0.3) +
    coord_fixed()

Invertible Integration

Tools like MNN and Harmony take a PCA embedding and remove the effects of the specified covariates. But there is no way to go back from the integrated embedding to the original gene space. This means that we cannot ask the counter factual what the expression of a cell from the control condition would have been, had it been treated.

A new tool called LEMUR provides this functionality by matching the subspace of each condition (Ahlmann-Eltze and Huber 2023). Figure 5 illustrates the principle.

Figure 5: Schematic picture of data from two conditions with the respective linear subspace.

LEMUR takes as input a SingleCellExperiment object, the specification of the experimental design, and the number of latent dimensions. To refine the embedding, we will use the provided cell type annotations.

fit <- lemur::lemur(sce, design = ~ stim, n_embedding = 30, verbose = FALSE)
fit <- lemur::align_by_grouping(fit, fit$colData$cell, verbose = FALSE)
fit <- lemur::lemur(sce, design = ~ stim, n_embedding = 30, verbose = FALSE)
fit <- lemur::align_harmony(fit, verbose = FALSE)

Making a UMAP plot of LEMUR’s embedding shows that it sucessfully integrated the conditions (Figure 6).

lemur_umap <- uwot::umap(t(fit$embedding))

as_tibble(colData(sce)) |>
  mutate(umap = lemur_umap) |>
  sample_frac() |>
  ggplot(aes(x = umap[,1], y = umap[,2])) +
    geom_point(aes(color = stim), size = 0.3) +
    coord_fixed()

Figure 6: UMAP plot of LEMUR’s invertible embedding.

The advantage of the invertible integration is that we can make predictions what the expression of a cell from the control condition would have been, had it been stimulated and vice versa. Contrasting those predictions tells us how much the gene expression changes for that cell in any gene.

Differential expression with an invertible integration

We call LEMUR’s test_de function to compare the expression values in the "stim" and "ctrl" conditions.

fit <- lemur::test_de(fit, contrast = cond(stim = "stim") - cond(stim = "ctrl"))

We can now pick individual genes and plot the predicted log fold change for each cell to show how it varies as a function of the underlying gene expression values ?@fig-lemur_plot_de.

as_tibble(colData(sce)) |>
  mutate(umap = lemur_umap) |>
  mutate(expr = logcounts(fit)["PLSCR1",]) |>
  sample_frac() |>
  ggplot(aes(x = umap[,1], y = umap[,2])) +
    geom_point(aes(color = expr), size = 0.3) +
    scale_color_viridis_c() +
    facet_wrap(vars(stim)) +
    coord_fixed()

as_tibble(colData(sce)) |>
  mutate(umap = lemur_umap) |>
  mutate(de = assay(fit, "DE")["PLSCR1",]) |>
  sample_frac() |>
  ggplot(aes(x = umap[,1], y = umap[,2])) +
    geom_point(aes(color = de), size = 0.3) +
    scale_color_gradient2() +
    coord_fixed()

Figure 7: Expression of PLSCR1 in control and stim condition

Figure 8: LEMUR’s prediction of differential expression
nei <- lemur::find_de_neighborhoods(fit, group_by = vars(stim, ind), verbose = FALSE)
as_tibble(nei) %>%
  arrange(pval)
# A tibble: 500 × 13
   name   neighborhood n_cells sel_statistic     pval adj_pval f_statistic   df1
   <chr>  <I<list>>      <int>         <dbl>    <dbl>    <dbl>       <dbl> <int>
 1 IL1RN  <chr>           5156          426. 5.28e-19 2.64e-16       1612.     1
 2 MT2A   <chr>          27129          279. 1.71e-17 4.27e-15       1088.     1
 3 NT5C3A <chr>          11862          354. 3.05e-17 5.08e-15       1019.     1
 4 EIF2A… <chr>          28815          608. 6.69e-17 8.37e-15        932.     1
 5 IFIT2  <chr>           8536          294. 1.45e-16 1.45e-14        853.     1
 6 CXCL11 <chr>           7761          372. 6.99e-16 5.41e-14        713.     1
 7 ISG20  <chr>           7988          741. 7.57e-16 5.41e-14        707.     1
 8 IFI35  <chr>           8357          495. 1.07e-15 6.01e-14        680.     1
 9 SSB    <chr>           4677          451. 1.08e-15 6.01e-14        678.     1
10 IFI16  <chr>          28137          420. 1.20e-15 6.01e-14        670.     1
# ℹ 490 more rows
# ℹ 5 more variables: df2 <dbl>, lfc <dbl>, did_pval <dbl>, did_adj_pval <dbl>,
#   did_lfc <dbl>

Session Info

sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Big Sur 11.7.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Berlin
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] muscData_1.14.0             ExperimentHub_2.8.0        
 [3] AnnotationHub_3.8.0         BiocFileCache_2.8.0        
 [5] dbplyr_2.3.2                SingleCellExperiment_1.22.0
 [7] SummarizedExperiment_1.30.2 Biobase_2.60.0             
 [9] GenomicRanges_1.52.0        GenomeInfoDb_1.36.0        
[11] IRanges_2.34.0              S4Vectors_0.38.1           
[13] BiocGenerics_0.46.0         MatrixGenerics_1.12.2      
[15] matrixStats_1.0.0           lubridate_1.9.2            
[17] forcats_1.0.0               stringr_1.5.0              
[19] dplyr_1.1.2                 purrr_1.0.1                
[21] readr_2.1.4                 tidyr_1.3.0                
[23] tibble_3.2.1                ggplot2_3.4.4              
[25] tidyverse_2.0.0            

loaded via a namespace (and not attached):
  [1] rstudioapi_0.14               jsonlite_1.8.8               
  [3] magrittr_2.0.3                ggbeeswarm_0.7.2             
  [5] farver_2.1.1                  rmarkdown_2.22               
  [7] zlibbioc_1.46.0               vctrs_0.6.2                  
  [9] memoise_2.0.1                 DelayedMatrixStats_1.22.0    
 [11] RCurl_1.98-1.12               htmltools_0.5.5              
 [13] S4Arrays_1.0.4                curl_5.0.1                   
 [15] BiocNeighbors_1.18.0          htmlwidgets_1.6.2            
 [17] cachem_1.0.8                  ResidualMatrix_1.10.0        
 [19] igraph_1.4.3                  mime_0.12                    
 [21] lifecycle_1.0.3               pkgconfig_2.0.3              
 [23] rsvd_1.0.5                    Matrix_1.6-4                 
 [25] R6_2.5.1                      fastmap_1.1.1                
 [27] GenomeInfoDbData_1.2.10       shiny_1.7.4                  
 [29] digest_0.6.31                 colorspace_2.1-0             
 [31] AnnotationDbi_1.62.1          scater_1.28.0                
 [33] irlba_2.3.5.1                 RSQLite_2.3.1                
 [35] beachmat_2.16.0               filelock_1.0.2               
 [37] labeling_0.4.2                fansi_1.0.4                  
 [39] timechange_0.2.0              httr_1.4.6                   
 [41] compiler_4.3.0                bit64_4.0.5                  
 [43] withr_2.5.0                   BiocParallel_1.34.2          
 [45] viridis_0.6.3                 DBI_1.1.3                    
 [47] rappdirs_0.3.3                DelayedArray_0.26.3          
 [49] lemur_1.0.5                   tools_4.3.0                  
 [51] vipor_0.4.5                   beeswarm_0.4.0               
 [53] interactiveDisplayBase_1.38.0 httpuv_1.6.11                
 [55] glmGamPoi_1.13.3              glue_1.6.2                   
 [57] batchelor_1.16.0              promises_1.2.0.1             
 [59] grid_4.3.0                    generics_0.1.3               
 [61] gtable_0.3.3                  tzdb_0.4.0                   
 [63] hms_1.1.3                     BiocSingular_1.16.0          
 [65] ScaledMatrix_1.8.1            utf8_1.2.3                   
 [67] XVector_0.40.0                RcppAnnoy_0.0.20             
 [69] ggrepel_0.9.3                 BiocVersion_3.17.1           
 [71] pillar_1.9.0                  later_1.3.1                  
 [73] splines_4.3.0                 lattice_0.21-8               
 [75] bit_4.0.5                     tidyselect_1.2.0             
 [77] Biostrings_2.68.1             scuttle_1.10.1               
 [79] knitr_1.43                    gridExtra_2.3                
 [81] RhpcBLASctl_0.23-42           xfun_0.39                    
 [83] stringi_1.7.12                yaml_2.3.7                   
 [85] evaluate_0.21                 codetools_0.2-19             
 [87] BiocManager_1.30.20           cli_3.6.1                    
 [89] uwot_0.1.14                   xtable_1.8-4                 
 [91] munsell_0.5.0                 harmony_1.2.0                
 [93] Rcpp_1.0.10                   png_0.1-8                    
 [95] parallel_4.3.0                ellipsis_0.3.2               
 [97] blob_1.2.4                    sparseMatrixStats_1.13.4     
 [99] bitops_1.0-7                  viridisLite_0.4.2            
[101] scales_1.2.1                  crayon_1.5.2                 
[103] rlang_1.1.1                   cowplot_1.1.1                
[105] KEGGREST_1.40.0              

References

Ahlmann-Eltze, Constantin, and Wolfgang Huber. 2023. “Analysis of Multi-Condition Single-Cell Data with Latent Embedding Multivariate Regression.” http://dx.doi.org/10.1101/2023.03.06.531268.
Argelaguet, Ricard, Anna S. E. Cuomo, Oliver Stegle, and John C. Marioni. 2021. “Computational Principles and Challenges in Single-Cell Data Integration.” Nature Biotechnology 39 (10): 1202–15. https://doi.org/10.1038/s41587-021-00895-7.
Haghverdi, Laleh, Aaron T L Lun, Michael D Morgan, and John C Marioni. 2018. “Batch Effects in Single-Cell RNA-Sequencing Data Are Corrected by Matching Mutual Nearest Neighbors.” Nature Biotechnology 36 (5): 421–27. https://doi.org/10.1038/nbt.4091.
Kang, Hyun Min, Meena Subramaniam, Sasha Targ, Michelle Nguyen, Lenka Maliskova, Elizabeth McCarthy, Eunice Wan, et al. 2018. “Multiplexed Droplet Single-Cell RNA-Sequencing Using Natural Genetic Variation.” Nature Biotechnology 36 (1): 89–94. https://doi.org/10.1038/nbt.4042.
Korsunsky, Ilya, Nghia Millard, Jean Fan, Kamil Slowikowski, Fan Zhang, Kevin Wei, Yuriy Baglaenko, Michael Brenner, Po-ru Loh, and Soumya Raychaudhuri. 2019. “Fast, Sensitive and Accurate Integration of Single-Cell Data with Harmony.” Nature Methods 16 (12): 1289–96. https://doi.org/10.1038/s41592-019-0619-0.
Luecken, Malte D., M. Büttner, K. Chaichoompu, A. Danese, M. Interlandi, M. F. Mueller, D. C. Strobl, et al. 2022. “Benchmarking Atlas-Level Data Integration in Single-Cell Genomics.” Nature Methods 19 (1): 41–50. https://doi.org/10.1038/s41592-021-01336-8.